In this tutorial, we are going to mainly use Seurat package with publicly available datasets. Extensive tutorials with various contexts can be found in https://satijalab.org/seurat/.
Here, in the first part, we are going to analyze a single cell RNAseq dataset product by 10X Genomics and processed through Cell Ranger(TM) pipeline to generate barcode count matrices.
Please, download the “4k Peripheral blood mononuclear cells (PBMCs) from a Healthy Donor” data from ifx:/data/pub/bionano2018/scRNAseqWS.zip and unzip.
The origin of the data is (http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_filtered_gene_bc_matrices.tar.gz) 4k PBMCs from a Healthy Donor*. Single Cell Gene Expression Dataset by Cell Ranger 2.1.0 with GRCh38
4,340 cells detected Sequenced on Illumina Hiseq4000 with approximately 87,000 reads per cell 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode Analysis run with –expect-cells=5000 Published on November 8, 2017
*This dataset is licensed under the Creative Commons Attribution license.
Seurat package provides a function for reading 10X datasets from a directory. This directory contains a matrix file (matrix.mtx) which stores UMI counts of genes for every cell in a sparse matrix format, a barcodes (barcodes.tsv) file which keeps the actual barcode sequences assigned to each cell, and a gene file (genes.tsv) for gene id/symbols from the transcriptome annotation.
library(Matrix)
library(Seurat)
library(dplyr)
library(SIMLR)
Read10X function reads the input files and stores them in a matrix with all information merged together.
pbmc.data <- Read10X(data.dir = "~/Downloads/scRNAseqWS/10Xdata/filtered_gene_bc_matrices/GRCh38/")
We, then, create a Seurat object file using the data matrix we just generated. The raw data is stored in the ‘raw.data’ slot of the Seurat object (pbmc@raw.data).
Filter cells and/or genes: min.cells = 3 : keep all genes expressed in >= 3 cells. min.genes = 200 : Keep all cells with at least 200 detected genes.
pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")
One important measure is the proportion of mitochondrial gene expression to overall expression because this gives us an idea about the sensitivity level of sampling other transcripts. This code below will calculate it for each cell and add them to the Seurat object metadata.
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)
percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ])/Matrix::colSums(pbmc@raw.data)
pbmc <- AddMetaData(object = pbmc,
metadata = percent.mito,
col.name = "percent.mito")
head(pbmc@meta.data)
## nGene nUMI orig.ident percent.mito
## AAACCTGAGAAGGCCT 748 1738 10X_PBMC 0.06386651
## AAACCTGAGACAGACC 1052 3240 10X_PBMC 0.05462963
## AAACCTGAGATAGTCA 739 1683 10X_PBMC 0.07367796
## AAACCTGAGCGCCTCA 875 2319 10X_PBMC 0.03839517
## AAACCTGAGGCATGGT 951 2983 10X_PBMC 0.02246061
## AAACCTGCAAGGTTCT 1248 4181 10X_PBMC 0.02224348
We can then check mitochondrial gene expression rate per cell as well as other two metrics internally calculated, nUMIs (Total UMI counts/cell) and nGene (number of genes detected).
VlnPlot(pbmc, features.plot = c("nGene", "nUMI", "percent.mito"))
Using a function from the package, GenePlot, it is possible to compare cells for their nUMI, nGene, and mito percent values.
Since there is a rare subset of cells with an outlier level of high mitochondrial percentage and also low UMI content, these can be used for filtration as well.
ggplot(pbmc@meta.data, aes(nUMI, nGene)) + geom_point()
ggplot(pbmc@meta.data, aes(nUMI, percent.mito)) + geom_point()
pbmc <- FilterCells(pbmc,
subset.names = c("nGene", "percent.mito"),
low.thresholds = c(200, -Inf),
high.thresholds = c(2500, 0.05))
#Check to see if filtration works as we expected.
VlnPlot(pbmc, features.plot = c("nGene", "nUMI", "percent.mito"))
The default normalization method provided is “Log normalization” which normalizes gene expression by cell total expression and multiplies by a scale factor (10,000) then log-transforms the value. The normalized values are then stored in the ‘data’ slot of the Seurat object (pbmc@data).
pbmc <- NormalizeData(pbmc,
normalization.method = "LogNormalize",
scale.factor = 10000)
Here, ‘FindVariableGenes’ function calculates variance and mean for every genes across all cells and sorts genes by their variance to mean ratios (VMR). We are going to select top 1000 genes as highly variable genes.
pbmc <- FindVariableGenes(pbmc,
mean.function = ExpMean,
dispersion.function = LogVMR,
do.plot = FALSE,
display.progress = FALSE)
hv.genes <- head(rownames(pbmc@hvg.info), 1000)
Scaling function, ScaleData, is used to scale and center the expression values of each gene. This function also gives us an opportunity to regress out any unwanted variation from known sources (linear regression). Keep in mind that since the downstream analysis such as dimension reduction is done using only highly variable genes, we can scale the data using only ‘hv.genes’. The scaled expression values are then store in the ‘scale.data’ slot of the object (pbmc@scale.data).
pbmc <- ScaleData(pbmc,
genes.use = hv.genes,
vars.to.regress = c("nUMI", "percent.mito"),
display.progress = FALSE)
Here, we are performing a Principal Component Analysis (PCA) on the normalized and scaled expression data using highly variable genes.
pbmc <- RunPCA(pbmc,
pc.genes = hv.genes,
do.print = FALSE)
The first plot shows the first and second principal components.
PCAPlot(pbmc, dim.1 = 1, dim.2 = 2)
This second plot demonstrates the standard deviation explained by each PC. We are going to include PCs up to where the graph makes a kink.
PCElbowPlot(pbmc)
Cluster cells is done using Shared Nearest Neighbor (SNN) method. First find k-nearest neighbors for every cell, then, construct a SNN graph. For more information about the algorithms, read Waltman and van Eck (2013) The European Physical Journal B. Each cell (labelled with their barcode sequences) are assing to a cluster id:
pbmc <- FindClusters(pbmc,
reduction.type = "pca",
dims.use = 1:10,
resolution = 0.6,
print.output = TRUE,
save.SNN = TRUE)
head(pbmc@ident,20)
## AAACCTGAGCGCCTCA AAACCTGAGGCATGGT AAACCTGCAAGGTTCT AAACCTGCATCCCATC
## 4 0 4 1
## AAACCTGCATGAAGTA AAACCTGGTACATCCA AAACCTGGTGCGGTAA AAACCTGTCGTGGTCG
## 2 1 0 0
## AAACCTGTCTCTGCTG AAACGGGAGGCTAGCA AAACGGGAGTGTCCAT AAACGGGGTCTTCTCG
## 4 4 5 2
## AAACGGGGTGGACGAT AAACGGGGTTTGCATG AAACGGGTCTGGGCCA AAACGGGTCTGGTATG
## 3 2 5 4
## AAAGATGAGACATAAC AAAGATGGTCCCTACT AAAGATGTCCGAATGT AAAGATGTCTGGTTCC
## 4 1 5 2
## Levels: 0 1 2 3 4 5 6 7 8 9
For visualizing clustered cells, we are going to use tSNE plot. When running tSNE we should be using the same PCs as we used previously in order to get the same clusters.
pbmc <- RunTSNE(pbmc, dims.use = 1:10, do.fast = TRUE)
TSNEPlot(pbmc, do.label = TRUE)
Finally, here, we are going to determine differentially expressed genes unique to each cluster. The ‘FindAllMarkers’ function takes the expression of each gene in one cluster and compares against to all other clusters. By default, statistical test used is ‘Wilcoxon rank sum test’; however, there are multiple other options including DESeq2. Here, we are further constrains such as ‘min.pct = 0.25’ meaning that it will test only genes expressed in at least 25% of the cells in the cluster, etc.
pbmc.markers <- FindAllMarkers(pbmc,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25)
top5.markers <- pbmc.markers %>% group_by(cluster) %>% top_n(5, avg_logFC)
best.markers <- pbmc.markers %>% group_by(cluster) %>% top_n(1, avg_logFC)
FeaturePlot(object = pbmc,
features.plot = best.markers$gene,
cols.use = c("grey", "blue"),
reduction.use = "tsne")
Alignment workflow for the four mouse brain datasets
Zeisel et al: Single-cell RNA-seq of mouse cerebral cortex Tasic et al: Adult mouse cortical cell taxonomy by single cell transcriptomics Romanov et al: Single-cell RNA-seq of mouse hypothalamus Marques et al: RNA-seq analysis of single cells of the oligodendrocyte lineage from nine distinct regions of the anterior-posterior and dorsal-ventral axis of the mouse juvenile central nervous system
The raw requencing reads were downloaded public SRA database and processed using Kallisto quantification pipeline.
We start by loading the expression data and the metadata of each study.
ob.list <- list("zeisel", "romanov", "tasic", "marques")
#Load the expression and meta data of each study.
for (i in 1:length(ob.list)){
obj.data <- paste(ob.list[[i]],".data",sep="");
#Read the expression matrix from a text file for each dataset.
assign(obj.data, read.delim(paste("~/Downloads/scRNAseqWS/",ob.list[[i]],".expression.txt",sep=""), header = TRUE, row.names = 1))
}
#Since the expression matrices of these datasets are in TPM (already normalized), we are going to skip NormalizeData step in the following steps. However, we still need to log-transform it. log1p = log(1+x), natural log.
zeisel.data <- log1p(zeisel.data)
romanov.data <- log1p(romanov.data)
tasic.data <- log1p(tasic.data)
marques.data <- log1p(marques.data)
for (i in 1:length(ob.list)){
obj.meta <- paste(ob.list[[i]],".meta",sep="");
#Reading the Run information meta data from a text file for each dataset.
assign(obj.meta, read.delim(paste("~/Downloads/scRNAseqWS/",ob.list[[i]],".RunTable.txt",sep=""), header = TRUE))
}
Check the PCA and tSNE prior to alignment:
rownames(zeisel.meta) <- zeisel.meta$Run_s
rownames(romanov.meta) <- romanov.meta$Run_s
rownames(tasic.meta) <- tasic.meta$Run_s
rownames(marques.meta) <- marques.meta$Run_s
batches <- rbind(zeisel.meta[,c("Run_s","Owner","SRA_Study_s")],
tasic.meta[,c("Run_s","Owner","SRA_Study_s")],
romanov.meta[,c("Run_s","Owner","SRA_Study_s")],
marques.meta[,c("Run_s","Owner","SRA_Study_s")])
combined.data <- cbind(zeisel.data, tasic.data, romanov.data, marques.data)
combined.data <- as(as.matrix(combined.data), "dgCMatrix")
fourDataset <- CreateSeuratObject(raw.data = combined.data,
project = "4dataset.Pre")
fourDataset <- AddMetaData(fourDataset, metadata = batches)
fourDataset <- FilterCells(fourDataset,
subset.names = "nGene",
low.thresholds = 2500)
fourDataset <- FindVariableGenes(fourDataset,
do.plot = F,
display.progress = F)
fourDataset <- ScaleData(fourDataset, display.progress = F)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
fourDataset <- RunPCA(fourDataset,
pc.genes = fourDataset@var.genes,
pcs.compute = 5,
do.print = FALSE)
PCAPlot(fourDataset, pt.size=1, group.by ="Owner", dim.1 = 1, dim.2 = 2)
fourDataset <- RunTSNE(fourDataset,
reduction.use = "pca",
dims.use = 1:5)
TSNEPlot(fourDataset, do.label = T, group.by ="Owner")
#Subset the data
zeisel <- SubsetData(fourDataset,
cells.use=names(zeisel.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
tasic <- SubsetData(fourDataset,
cells.use=names(tasic.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
romanov <- SubsetData(fourDataset,
cells.use=names(romanov.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
marques <- SubsetData(fourDataset,
cells.use=names(marques.data), do.center=T, do.scale=T)
## NormalizeData has not been run, therefore ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## ScaleData is running on non-normalized values. Recommended workflow is to run NormalizeData first.
## Scaling data matrix
Determine genes to use for CCA, must be highly variable in at least 2 datasets
ob.list <- list(zeisel, romanov, tasic, marques)
genes.use <- c()
for (i in 1:length(ob.list)) {
genes.use <- c(genes.use, head(rownames(ob.list[[i]]@hvg.info), 1000))
}
genes.use <- names(which(table(genes.use) > 1))
for (i in 1:length(ob.list)) {
genes.use <- genes.use[genes.use %in% rownames(ob.list[[i]]@scale.data)]
}
Run multi-set CCA
mouseBrain.integrated <- RunMultiCCA(ob.list, genes.use = genes.use, num.ccs = 15)
## [1] "Computing CC 1"
## [1] "Computing CC 2"
## [1] "Computing CC 3"
## [1] "Computing CC 4"
## [1] "Computing CC 5"
## [1] "Computing CC 6"
## [1] "Computing CC 7"
## [1] "Computing CC 8"
## [1] "Computing CC 9"
## [1] "Computing CC 10"
## [1] "Computing CC 11"
## [1] "Computing CC 12"
## [1] "Computing CC 13"
## [1] "Computing CC 14"
## [1] "Computing CC 15"
## Scaling data matrix
Run rare non-overlapping filtering
mouseBrain.integrated <- CalcVarExpRatio(mouseBrain.integrated,
reduction.type = "pca",
grouping.var = "Owner",
dims.use = 1:10)
mouseBrain.integrated <- SubsetData(mouseBrain.integrated,
subset.name = "var.ratio.pca",
accept.low = 0.5)
Alignment:
mouseBrain.integrated <- AlignSubspace(mouseBrain.integrated,
reduction.type = "cca",
grouping.var = "Owner",
dims.align = 1:10)
t-SNE and Clustering
mouseBrain.integrated <- FindClusters(mouseBrain.integrated,
reduction.type = "cca.aligned",
dims.use = 1:10, save.SNN = T,
resolution = 0.4)
mouseBrain.integrated <- RunTSNE(mouseBrain.integrated,
reduction.use = "cca.aligned",
dims.use = 1:10,
check_duplicates = FALSE)
# Visualization
TSNEPlot(mouseBrain.integrated, do.label = T)
TSNEPlot(mouseBrain.integrated, do.label = T, group.by ="Owner")
library(SIMLR)
set.seed(11111)
#zeisel.data is already in log.
# Determine optimal number of clusters as described in the Nat. Methods paper
# picka cluster range and reports two metrics; the lower the value the more
# support for that number of clusters; in my limited experience these methods
# are concordant.
zclust<-SIMLR_Estimate_Number_of_Clusters(zeisel.data, NUMC=2:5)
#run SIMLR
zsimlr<-SIMLR(zeisel.data, 4)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 6
## Iteration: 7
## Iteration: 8
## Iteration: 9
## Iteration: 10
## Iteration: 11
## Iteration: 12
## Iteration: 13
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 0.1880229
## Epoch: Iteration # 200 error is: 0.1628646
## Epoch: Iteration # 300 error is: 0.1566613
## Epoch: Iteration # 400 error is: 0.1535306
## Epoch: Iteration # 500 error is: 0.1516693
## Epoch: Iteration # 600 error is: 0.1503478
## Epoch: Iteration # 700 error is: 0.1492706
## Epoch: Iteration # 800 error is: 0.1484344
## Epoch: Iteration # 900 error is: 0.1477013
## Epoch: Iteration # 1000 error is: 0.1470672
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100 error is: 12.09497
## Epoch: Iteration # 200 error is: 0.2800904
## Epoch: Iteration # 300 error is: 0.2665204
## Epoch: Iteration # 400 error is: 0.262567
## Epoch: Iteration # 500 error is: 0.2604442
## Epoch: Iteration # 600 error is: 0.2590736
## Epoch: Iteration # 700 error is: 0.258107
## Epoch: Iteration # 800 error is: 0.2573748
## Epoch: Iteration # 900 error is: 0.256798
## Epoch: Iteration # 1000 error is: 0.2563259
# Create plotting function, color-coding points by cluster membership
plotSIMLRclusters <- function(obj) {
col <- ifelse(obj$y$cluster==1, 'red', ifelse(obj$y$cluster==2, 'blue',
ifelse(obj$y$cluster==3, 'green','yellow')))
plot(obj$ydata, col=col, xlab = "SIMLR component 1", ylab = "SIMLR component 2", pch=20, cex=0.7)
}
# Call plotting function
plotSIMLRclusters(zsimlr)
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin17.5.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 SIMLR_1.6.0 dplyr_0.7.6 Seurat_2.3.3
## [5] cowplot_0.9.3 ggplot2_3.0.0 Matrix_1.2-14
##
## loaded via a namespace (and not attached):
## [1] diffusionMap_1.1-0 Rtsne_0.13 colorspace_1.3-2
## [4] class_7.3-14 modeltools_0.2-22 ggridges_0.5.0
## [7] mclust_5.4.1 rprojroot_1.3-2 htmlTable_1.12
## [10] base64enc_0.1-3 rstudioapi_0.7 proxy_0.4-22
## [13] flexmix_2.3-14 bit64_0.9-7 RSpectra_0.13-1
## [16] mvtnorm_1.0-8 codetools_0.2-15 splines_3.5.0
## [19] R.methodsS3_1.7.1 robustbase_0.93-1 knitr_1.20
## [22] Formula_1.2-3 jsonlite_1.5 ica_1.0-2
## [25] cluster_2.0.7-1 kernlab_0.9-26 png_0.1-7
## [28] R.oo_1.22.0 compiler_3.5.0 backports_1.1.2
## [31] assertthat_0.2.0 lazyeval_0.2.1 lars_1.2
## [34] acepack_1.4.1 htmltools_0.3.6 tools_3.5.0
## [37] igraph_1.2.1 gtable_0.2.0 glue_1.2.0
## [40] RANN_2.6 reshape2_1.4.3 Rcpp_0.12.17
## [43] trimcluster_0.1-2 gdata_2.18.0 ape_5.1
## [46] nlme_3.1-137 iterators_1.0.10 fpc_2.1-11
## [49] lmtest_0.9-36 stringr_1.3.1 irlba_2.3.2
## [52] gtools_3.8.1 DEoptimR_1.0-8 MASS_7.3-50
## [55] zoo_1.8-3 scales_0.5.0 doSNOW_1.0.16
## [58] parallel_3.5.0 RColorBrewer_1.1-2 yaml_2.1.19
## [61] reticulate_1.9 pbapply_1.3-4 gridExtra_2.3
## [64] rpart_4.1-13 segmented_0.5-3.0 latticeExtra_0.6-28
## [67] stringi_1.2.3 foreach_1.4.4 checkmate_1.8.5
## [70] caTools_1.17.1 SDMTools_1.1-221 rlang_0.2.1
## [73] pkgconfig_2.0.1 dtw_1.20-1 prabclus_2.2-6
## [76] bitops_1.0-6 pracma_2.1.4 evaluate_0.10.1
## [79] lattice_0.20-35 ROCR_1.0-7 purrr_0.2.5
## [82] bindr_0.1.1 labeling_0.3 htmlwidgets_1.2
## [85] bit_1.1-14 tidyselect_0.2.4 RcppAnnoy_0.0.10
## [88] plyr_1.8.4 magrittr_1.5 R6_2.2.2
## [91] snow_0.4-2 gplots_3.0.1 Hmisc_4.1-1
## [94] pillar_1.3.0 foreign_0.8-70 withr_2.1.2
## [97] fitdistrplus_1.0-9 mixtools_1.1.0 survival_2.42-6
## [100] scatterplot3d_0.3-41 nnet_7.3-12 tibble_1.4.2
## [103] tsne_0.1-3 crayon_1.3.4 hdf5r_1.0.0
## [106] KernSmooth_2.23-15 rmarkdown_1.10 grid_3.5.0
## [109] data.table_1.11.4 metap_0.9 digest_0.6.15
## [112] diptest_0.75-7 tidyr_0.8.1 R.utils_2.6.0
## [115] stats4_3.5.0 munsell_0.5.0